source("utils.R")
txt <- readLines('data/joyce/1922_ulysses.txt')
print(txt[1:5])
[1] "Part One. The Telemachiad" [2] "" [3] "Episode 1. Telemachus" [4] "" [5] "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air. He held the bowl aloft and intoned:"
doclines <- readLines('data/joyce/1922_ulysses.txt')
splitted <- split_even(doclines, 1000)
print(splitted[1:2])
[1] "Part One. The Telemachiad Episode 1. Telemachus Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air. He held the bowl aloft and intoned: —Introibo ad altare Dei. Halted, he peered down the dark winding stairs and called out coarsely: —Come up, Kinch! Come up, you fearful jesuit! Solemnly he came forward and mounted the round gunrest. He faced about and blessed gravely thrice the tower, the surrounding land and the awaking mountains. Then, catching sight of Stephen Dedalus, he bent towards him and made rapid crosses in the air, gurgling in his throat and shaking his head. Stephen Dedalus, displeased and sleepy, leaned his arms on the top of the staircase and looked coldly at the shaking gurgling face that blessed him, equine in its length, and at the light untonsured hair, grained and hued like pale oak. Buck Mulligan peeped an instant under the mirror and then covered the bowl smartly. —Back to barracks! he said sternly. He added in a preacher’s tone: —For this, O dearly beloved, is the genuine Christine: body and soul and blood and ouns. Slow music, please. Shut your eyes, gents. One moment. A little trouble about those white corpuscles. Silence, all. He peered sideways up and gave a long slow whistle of call, then paused awhile in rapt attention, his even white teeth glistening here and there with gold points. Chrysostomos. Two strong shrill whistles answered through the calm. —Thanks, old chap, he cried briskly. That will do nicely. Switch off the current, will you? He skipped off the gunrest and looked gravely at his watcher, gathering about his legs the loose folds of his gown. The plump shadowed face and sullen oval jowl recalled a prelate, patron of arts in the middle ages. A pleasant smile broke quietly over his lips. —The mockery of it! he said gaily. Your absurd name, an ancient Greek! He pointed his finger in friendly jest and went over to the parapet, laughing to himself. Stephen Dedalus stepped up, followed him wearily halfway and sat down on the edge of the gunrest, watching him still as he propped his mirror on the parapet, dipped the brush in the bowl and lathered cheeks and neck. Buck Mulligan’s gay voice went on. —My name is absurd too: Malachi Mulligan, two dactyls. But it has a Hellenic ring, hasn’t it? Tripping and sunny like the buck himself. We must go to Athens. Will you come if I can get the aunt to fork out twenty quid? He laid the brush aside and, laughing with delight, cried: —Will he come? The jejune jesuit! Ceasing, he began to shave with care. —Tell me, Mulligan, Stephen said quietly. —Yes, my love? —How long is Haines going to stay in this tower? Buck Mulligan showed a shaven cheek over his right shoulder. —God, isn’t he dreadful? he said frankly. A ponderous Saxon. He thinks you’re not a gentleman. God, these bloody English! Bursting with money and indigestion. Because he comes from Oxford. You know, Dedalus, you have the real Oxford manner. He can’t make you out. O, my name for you is the best: Kinch, the knife-blade. He shaved warily over his chin. —He was raving all night about a black panther, Stephen said. Where is his guncase? —A woful lunatic! Mulligan said. Were you in a funk? —I was, Stephen said with energy and growing fear. Out here in the dark with a man I don’t know raving and moaning to himself about shooting a black panther. You saved men from drowning. I’m not a hero, however. If he stays on here I am off. Buck Mulligan frowned at the lather on his razorblade. He hopped down from his perch and began to search his trouser pockets hastily. —Scutter! he cried thickly. He came over to the gunrest and, thrusting a hand into Stephen’s upper pocket, said: —Lend us a loan of your noserag to wipe my razor. Stephen suffered him to pull out and hold up on show by its corner a dirty crumpled handkerchief. Buck Mulligan wiped the razorblade neatly. Then, gazing over the handkerchief, he said: —The bard’s noserag! A new art colour for our Irish poets: snotgreen. You can almost taste it, can’t you? He mounted to the parapet again and gazed out over Dublin bay, his fair oakpale hair stirring slightly. —God! he said quietly. Isn’t the sea what Algy calls it: a grey sweet mother? The snotgreen sea. The scrotumtightening sea. Epi oinopa ponton. Ah, Dedalus, the Greeks! I must teach you. You must read them in the original. Thalatta! Thalatta! She is our great sweet mother. Come and look. Stephen stood up and went over to the parapet. Leaning on it he looked down on the water and on the mailboat clearing the harbourmouth of Kingstown. —Our mighty mother! Buck Mulligan said. He turned abruptly his grey searching eyes from the sea to Stephen’s face. —The aunt thinks you killed your mother, he said. That’s why she won’t let me have anything to do with you. —Someone killed her, Stephen said gloomily. —You could have knelt down, damn it, Kinch, when your dying mother asked you, Buck Mulligan said. I’m hyperborean as much as you. But to think of your mother begging you with her last breath to kneel down and pray for her. And you refused. There is something sinister in you... He broke off and lathered again lightly his farther cheek. A tolerant smile curled his lips. —But a lovely mummer! he murmured to himself. Kinch, the loveliest mummer of them all! He shaved evenly and with care, in silence, seriously. Stephen, an elbow rested on the jagged granite, leaned his palm against his brow and gazed at the fraying edge of his shiny black coat-sleeve. Pain, that was not yet the pain of love," [2] "fretted his heart. Silently, in a dream she had come to him after her death, her wasted body within its loose brown graveclothes giving off an odour of wax and rosewood, her breath, that had bent upon him, mute, reproachful, a faint odour of wetted ashes. Across the threadbare cuffedge he saw the sea hailed as a great sweet mother by the wellfed voice beside him. The ring of bay and skyline held a dull green mass of liquid. A bowl of white china had stood beside her deathbed holding the green sluggish bile which she had torn up from her rotting liver by fits of loud groaning vomiting. Buck Mulligan wiped again his razorblade. —Ah, poor dogsbody! he said in a kind voice. I must give you a shirt and a few noserags. How are the secondhand breeks? —They fit well enough, Stephen answered. Buck Mulligan attacked the hollow beneath his underlip. —The mockery of it, he said contentedly. Secondleg they should be. God knows what poxy bowsy left them off. I have a lovely pair with a hair stripe, grey. You’ll look spiffing in them. I’m not joking, Kinch. You look damn well when you’re dressed. —Thanks, Stephen said. I can’t wear them if they are grey. —He can’t wear them, Buck Mulligan told his face in the mirror. Etiquette is etiquette. He kills his mother but he can’t wear grey trousers. He folded his razor neatly and with stroking palps of fingers felt the smooth skin. Stephen turned his gaze from the sea and to the plump face with its smokeblue mobile eyes. —That fellow I was with in the Ship last night, said Buck Mulligan, says you have g.p.i. He’s up in Dottyville with Connolly Norman. General paralysis of the insane! He swept the mirror a half circle in the air to flash the tidings abroad in sunlight now radiant on the sea. His curling shaven lips laughed and the edges of his white glittering teeth. Laughter seized all his strong wellknit trunk. —Look at yourself, he said, you dreadful bard! Stephen bent forward and peered at the mirror held out to him, cleft by a crooked crack. Hair on end. As he and others see me. Who chose this face for me? This dogsbody to rid of vermin. It asks me too. —I pinched it out of the skivvy’s room, Buck Mulligan said. It does her all right. The aunt always keeps plainlooking servants for Malachi. Lead him not into temptation. And her name is Ursula. Laughing again, he brought the mirror away from Stephen’s peering eyes. —The rage of Caliban at not seeing his face in a mirror, he said. If Wilde were only alive to see you! Drawing back and pointing, Stephen said with bitterness: —It is a symbol of Irish art. The cracked looking-glass of a servant. Buck Mulligan suddenly linked his arm in Stephen’s and walked with him round the tower, his razor and mirror clacking in the pocket where he had thrust them. —It’s not fair to tease you like that, Kinch, is it? he said kindly. God knows you have more spirit than any of them. Parried again. He fears the lancet of my art as I fear that of his. The cold steel pen. —Cracked lookingglass of a servant! Tell that to the oxy chap downstairs and touch him for a guinea. He’s stinking with money and thinks you’re not a gentleman. His old fellow made his tin by selling jalap to Zulus or some bloody swindle or other. God, Kinch, if you and I could only work together we might do something for the island. Hellenise it. Cranly’s arm. His arm. —And to think of your having to beg from these swine. I’m the only one that knows what you are. Why don’t you trust me more? What have you up your nose against me? Is it Haines? If he makes any noise here I’ll bring down Seymour and we’ll give him a ragging worse than they gave Clive Kempthorpe. Young shouts of moneyed voices in Clive Kempthorpe’s rooms. Palefaces: they hold their ribs with laughter, one clasping another. O, I shall expire! Break the news to her gently, Aubrey! I shall die! With slit ribbons of his shirt whipping the air he hops and hobbles round the table, with trousers down at heels, chased by Ades of Magdalen with the tailor’s shears. A scared calf’s face gilded with marmalade. I don’t want to be debagged! Don’t you play the giddy ox with me! Shouts from the open window startling evening in the quadrangle. A deaf gardener, aproned, masked with Matthew Arnold’s face, pushes his mower on the sombre lawn watching narrowly the dancing motes of grasshalms. To ourselves... new paganism... omphalos. —Let him stay, Stephen said. There’s nothing wrong with him except at night. —Then what is it? Buck Mulligan asked impatiently. Cough it up. I’m quite frank with you. What have you against me now? They halted, looking towards the blunt cape of Bray Head that lay on the water like the snout of a sleeping whale. Stephen freed his arm quietly. —Do you wish me to tell you? he asked. —Yes, what is it? Buck Mulligan answered. I don’t remember anything. He looked in Stephen’s face as he spoke. A light wind passed his brow, fanning softly his fair uncombed hair and stirring silver points of anxiety in his eyes. Stephen, depressed by his own voice, said: —Do you remember the first day I went to your house after my mother’s death? Buck Mulligan frowned quickly and said: —What? Where? I can’t remember anything. I remember only ideas and sensations. Why? What happened in the name of God? —You were making tea, Stephen said, and went across the landing to get more hot water. Your mother and some visitor came out of the drawingroom. She asked you who was in your room. —Yes? Buck Mulligan said. What did I say? I forget. —You said,"
# 특정 디렉토리에 있는 파일들을 읽어 문장으로 나눈 후 저장하기
target_dir <- "./data/woolf"
target_files <- list.files(target_dir, "txt")
for (file in target_files) {
doc <- readLines(file.path(target_dir, file))
sents <- split_sentences(doclines = doc)
sents <- unlist(sents)
filename_1 <- gsub("(.*)(\\.txt)", "\\1_sent\\2", file)
write(sents, file.path(target_dir, filename_1))
even_splits <- split_even(doc, 50)
filename_2 <- gsub("(.*)(\\.txt)", "\\1_even\\2", file)
write(even_splits, file.path(target_dir, filename_2))
}
library(word2vec)
# 텍스트 파일들을 읽은 후 모두 소문자로 변환한다.
# 아래 예시에서는 woolf 디렉토리에 있는 작품 파일들을 모두 한번에 읽어 하나의 문서로 만든다.
woolf_files <- list.files("./data/woolf", "even.*txt", full.names = TRUE)
# for 구문 대신 lapply 함수를 이용. for 구문을 사용해도 됨.
woolf_texts <- lapply(woolf_files, function(x) {
doc <- readLines(x)
doc <- tolower(doc)
return(doc)
})
woolf_texts <- unlist(woolf_texts)
# set.seed(10)
# word2vec 함수로 woolf_texts를 이용한 임베딩 모델을 생성
woolf_model <- word2vec(x = woolf_texts,
type = "skip-gram",
dim = 50,
window = 5,
negative = 5,
iter = 100,
threads = 8)
# 생성된 모델의 내용을 확인
str(woolf_model)
# 생성한 모델을 저장
write.word2vec(woolf_model, "./analysis/embeddings/woolf_model.bin")
# # 저장한 모델을 읽어오기
# woolf_model_loaded <- read.word2vec("./analysis/embeddings/woolf_model.bin")
# 유사한 단어 추출하기(한 단어)
preds <- predict(woolf_model, 'queer', type = "nearest")
print(preds)
$queer term1 term2 similarity rank 1 queer odd 0.8386722 1 2 queer kind 0.8193455 2 3 queer strange 0.7842509 3 4 queer chuckle 0.7828981 4 5 queer disturbing 0.7712168 5 6 queer sickly 0.7691061 6 7 queer vague 0.7686276 7 8 queer frightening 0.7667588 8 9 queer amusing 0.7642149 9 10 queer spontaneous 0.7588287 10
# 유사한 단어 추출하기(두 단어 이상)
preds <- predict(woolf_model, c('lady', 'gentleman'), type = "nearest")
print(preds)
$lady
term1 term2 similarity rank
1 lady queen 0.8267298 1
2 lady gentleman 0.8171982 2
3 lady friend 0.8119703 3
4 lady walpole 0.8112606 4
5 lady prince 0.8082371 5
6 lady lord 0.7913446 6
7 lady dorothy 0.7849203 7
8 lady earl 0.7843787 8
9 lady lover 0.7814237 9
10 lady daughter 0.7805042 10
$gentleman
term1 term2 similarity rank
1 gentleman servant 0.8340471 1
2 gentleman lady 0.8171982 2
3 gentleman man 0.8027343 3
4 gentleman girl 0.7983292 4
5 gentleman princess 0.7849668 5
6 gentleman woman 0.7830075 6
7 gentleman haired 0.7782508 7
8 gentleman maid 0.7748181 8
9 gentleman young 0.7730316 9
10 gentleman tweed 0.7597733 10
# 필요한 라이브러리 탑재
library(Rtsne)
library(ggplot2)
library(ggrepel)
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
# 생성한 woolf 모델을 매트릭스로 변환
woolf_embedding <- as.matrix(woolf_model_loaded)
print(dim(woolf_embedding))
cat("----------------\n")
print(woolf_embedding[1:3, 1:5])
[1] 11704 50
----------------
[,1] [,2] [,3] [,4] [,5]
brompton -0.4057100 1.256959 -1.4611098 -2.2586901 -0.9518408
trailed -0.5232229 0.356214 -1.1081522 0.7148921 -0.7097381
scope 0.3262754 1.411586 0.5579904 -0.5022139 -0.6621220
# Rtsne 함수로 50차원을 2차원으로 축소하여 x, y 매트릭스로 만듦
dim_redu <- Rtsne(woolf_embedding, dims = 2, pca = TRUE)
viz <- dim_redu$Y
print(head(viz))
[,1] [,2] [1,] -9.0640912 2.687415 [2,] 0.7679984 11.271722 [3,] 10.9026046 -12.119877 [4,] 13.5127512 19.770924 [5,] -13.2992564 -10.646265 [6,] -8.3744385 7.362357
# 전체 어휘 중 1번부터 50번까지만 시각화해 봄
plot(viz[1:50,], t = "n")
text(viz[1:50,], labels = rownames(woolf_embedding)[1:50])
# 'queer'와 유사한 단어 추출
queer_similar_words <- predict(woolf_model_loaded, 'queer', type = "nearest", top_n = 20)[[1]]$term2
print(queer_similar_words)
[1] "odd" "kind" "strange" "chuckle" "disturbing" [6] "sickly" "vague" "frightening" "amusing" "spontaneous" [11] "frightened" "disagreeable" "curious" "charming" "amusement" [16] "sad" "terrifying" "disliked" "oddly" "sinister"
# 'queer' 및 queer와 유사한 어휘에 해당하는 아이디 번호 확보
queer_id <- which(rownames(woolf_embedding) == "queer")
queer_sims_ids <- sapply(queer_similar_words, function(x) {
which(rownames(woolf_embedding) == x)
}, USE.NAMES = FALSE)
print(c(queer_id, queer_sims_ids))
[1] 368 2448 7496 2955 8196 3312 3781 11512 7441 8872 10932 1833 [13] 7939 704 7039 8830 8210 11096 3609 1961 6071
# 아이디 번호로 해당 어휘들의 2차원 매트릭스 추출
queer_embeddings <- viz[c(queer_id, queer_sims_ids),]
print(head(queer_embeddings))
[,1] [,2] [1,] 10.307106 -4.927119 [2,] 10.214328 -4.861577 [3,] 11.589360 -13.301554 [4,] 12.701972 -1.232904 [5,] 10.220544 -4.735102 [6,] 8.919012 -10.750757
# 추출한 매트릭스를 그래프로 표현
plot(queer_embeddings, t = "n")
text(queer_embeddings, labels = c("queer", queer_similar_words))
# 또다른 그래프: 뭉쳐진 레이블을 떨어뜨림/
df_ <- data.frame(word = c("queer", queer_similar_words),
X = queer_embeddings[, 1],
Y = queer_embeddings[, 2])
ggplot(df_, aes(x = X, y = Y, label = word)) +
geom_point() +
geom_text_repel(max.overlaps=Inf) + theme_void() +
labs(title = "word2vec_woolf_queer") +
theme_minimal()
# interactive plot으료 표현
plot_ly(df_, x = ~X, y = ~Y, type = "scatter", mode = "text", text = ~word)
# interactive plot의 저장 (html 파일로 저장)
library(htmlwidgets)
fig <- plot_ly(df_, x = ~X, y = ~Y, type = "scatter", mode = "text", text = ~word)
saveWidget(widget = fig, #the plotly object
file = "./analysis/figures/queer_embeddings.html", #the path & file name
selfcontained = TRUE) #creates a single html file
# 다른 알고리즘(UMAP)으로 차원 축소해 보기
library(uwot)
viz <- umap(woolf_embedding)
print(head(viz))
[,1] [,2] brompton 1.0314549 -1.2062538 trailed 2.7801361 0.6954176 scope -1.8108150 2.1450133 stripe 2.7365763 0.8246590 bullied -1.6193800 -0.9488240 lewes -0.9114819 -0.7766817
queer_embeddings <- viz[c("queer", queer_similar_words),]
print(head(queer_embeddings))
[,1] [,2] queer -0.5117765 1.136436 odd -0.4817193 1.002571 kind -1.4693935 1.746537 strange -0.3223265 1.053952 chuckle -0.5028764 1.089082 disturbing -1.1016185 1.648079
df_ <- data.frame(word = rownames(queer_embeddings),
x = queer_embeddings[, 1],
y = queer_embeddings[, 2])
ggplot(df_, aes(x = x, y = y, label = word)) +
geom_point() +
geom_text_repel(size = 3, max.overlaps = Inf) +
labs(title = "word2vec_queer_woolf") +
theme_minimal()
# interactive plot으료 표현
plot_ly(df_, x = ~x, y = ~y, type = "scatter", mode = "text", text = ~word)
# 세 작가의 텍스트를 모두 읽어오기
author_dirs <- c("./data/lawrence", "./data/stein", "./data/joyce")
author_files <- lapply(author_dirs, function(x) list.files(x, "even.*txt", full.names = TRUE))
# 읽어온 텍스트를 리스트로 저장
author_texts <- list()
for (author in author_files) {
texts <- lapply(author, function(x) {
doc <- readLines(x)
doc <- tolower(doc)
return(doc)
})
texts <- unlist(texts)
author_texts[[length(author_texts)+1]] <- texts
}
print(author_texts[[2]][1:3])
[1] "the good anna part i the tradesmen of bridgepoint learned to dread the sound of \"miss mathilda\", for with that name the good anna always conquered. the strictest of the one price stores found that they could give things for a little less, when" [2] "the good anna had fully said that \"miss mathilda\" could not pay so much and that she could buy it cheaper \"by lindheims.\" lindheims was anna's favorite store, for there they had bargain days, when flour and sugar were sold for a quarter of a cent less for a" [3] "pound, and there the heads of the departments were all her friends and always managed to give her the bargain prices, even on other days. anna led an arduous and troubled life. anna managed the whole little house for miss mathilda. it was a funny little house, one"
# 읽어온 텍스트로 작가별 임베딩 모델 만들기
authors <- c("Lawrence", "Stein", "Joyce") # 진행 상황을 나타내기 위해 작가 이름을 벡터로 만듦
# 작가별 임베딩 모델링
author_models <- list()
for (i in 1:length(author_texts)) {
message(paste(authors[i], "processing start---"))
author_model <- word2vec(x = author_texts[[i]],
type = "skip-gram",
dim = 50,
window = 5,
negative = 5,
iter = 200,
threads = 8)
author_models[[authors[i]]] <- author_model
message(paste(authors[i], "processing ended---"))
}
Lawrence processing start--- Lawrence processing ended--- Stein processing start--- Stein processing ended--- Joyce processing start--- Joyce processing ended---
# 생성한 임베딩 모델 확인해 보기
print(predict(author_models[[1]], "queer", type = "nearest"))
# 생성한 모델들을 저장
write.word2vec(author_models[[1]], "./analysis/embeddings/lawrence_embeddings.bin")
write.word2vec(author_models[[2]], "./analysis/embeddings/stein_embeddings.bin")
write.word2vec(author_models[[3]], "./analysis/embeddings/joyce_embeddings.bin")
# 유사도를 계산하기 위해 필요한 라이브러리를 탑재
library(lsa)
# 긍정어/부정어 목록 읽어오기
positive_lex <- read.csv("./opinion-lexicon/positive-words.txt", header = FALSE, comment.char = ";")
negative_lex<- read.csv("./opinion-lexicon/negative-words.txt", header = FALSE, comment.char = ";")
print(positive_lex[1:3,])
[1] "a+" "abound" "abounds"
# 작가별 임베딩 모델 로딩하기
authors <- c("joyce", "lawrence", "stein", "woolf")
target_dir <- "./analysis/embeddings"
embedding_files <- sort(list.files(target_dir, "bin", full.names = TRUE))
author_models <- list()
for (i in 1:length(embedding_files)) {
embed <- read.word2vec(embedding_files[i])
author_models[[authors[i]]] <- embed
}
author_embeddings <- list()
for (i in 1:length(author_models)) {
embed <- as.matrix(author_models[[i]])
author_embeddings[[authors[i]]] <- embed
}
# 'queer'와 유사한 100개 단어 추출하기
authors <- c("joyce", "lawrence", "stein", "woolf")
queer_nearest_words <- list()
for (i in 1:length(author_embeddings)) {
nearest_words <- predict(author_models[[i]], newdata = "queer", type = "nearest", top_n = 100)
queer_nearest_words[[authors[i]]] <- nearest_words[[1]]
}
# 작가별 queer 유사어에서 긍정어와 부정어 비율 구하기
joyce_pos_words <- queer_nearest_words[['joyce']]$term2[queer_nearest_words[['joyce']]$term2 %in% positive_lex$V1]
joyce_neg_words <- queer_nearest_words[['joyce']]$term2[queer_nearest_words[['joyce']]$term2 %in% negative_lex$V1]
lawrence_pos_words <- queer_nearest_words[['lawrence']]$term2[queer_nearest_words[['lawrence']]$term2 %in% positive_lex$V1]
lawrence_neg_words <- queer_nearest_words[['lawrence']]$term2[queer_nearest_words[['lawrence']]$term2 %in% negative_lex$V1]
stein_pos_words <- queer_nearest_words[['stein']]$term2[queer_nearest_words[['stein']]$term2 %in% positive_lex$V1]
stein_neg_words <- queer_nearest_words[['stein']]$term2[queer_nearest_words[['stein']]$term2 %in% negative_lex$V1]
woolf_pos_words <- queer_nearest_words[['woolf']]$term2[queer_nearest_words[['woolf']]$term2 %in% positive_lex$V1]
woolf_neg_words <- queer_nearest_words[['woolf']]$term2[queer_nearest_words[['woolf']]$term2 %in% negative_lex$V1]
print(woolf_pos_words)
cat("-----------------\n")
print(woolf_neg_words)
[1] "amusing" "spontaneous" "charming" "astonishingly" [5] "attractive" "smile" "pretty" "awe" [9] "wonderful" "astonishing" "magnificent" "amiable" [13] "nice" "sharp" "modest" "sensitive" [17] "bright" "romantic" "humble" "prominent" [21] "extraordinarily" [1] "odd" "strange" "disturbing" "sickly" "vague" [6] "frightening" "disagreeable" "sad" "disliked" "oddly" [11] "sinister" "solicitude" "painful" "awfully" "glum" [16] "ominous" "unpleasant" "distasteful" "sly" "incongruous" [21] "distaste" "suspiciously" "mystery" "pathetic" "unnecessary" [26] "oddest" "unusual" "melancholy" "flimsy" "alarming" [31] "boredom" "object"
# 작가별 긍정어와 부정어 어휘 갯수 확인
print(paste("The number of positive words in Joyce is", length(joyce_pos_words)))
print(paste("The number of negative words in Joyce is", length(joyce_neg_words)))
print(paste("The number of positive words in Lawrence is", length(lawrence_pos_words)))
print(paste("The number of negative words in Lawrence is", length(lawrence_neg_words)))
print(paste("The number of positive words in Stein is", length(stein_pos_words)))
print(paste("The number of negative words in Stein is", length(stein_neg_words)))
print(paste("The number of positive words in Woolf is", length(woolf_pos_words)))
print(paste("The number of negative words in Woolf is", length(woolf_neg_words)))
[1] "The number of positive words in Joyce is 8" [1] "The number of negative words in Joyce is 16" [1] "The number of positive words in Lawrence is 9" [1] "The number of negative words in Lawrence is 32" [1] "The number of positive words in Stein is 9" [1] "The number of negative words in Stein is 29" [1] "The number of positive words in Woolf is 21" [1] "The number of negative words in Woolf is 32"
# 작가별 긍정어와 부정어 어휘 비율 확인(100개 중)
print(paste("The ratio of positive words in Joyce is", length(joyce_pos_words)/100))
print(paste("The ratio of negative words in Joyce is", length(joyce_neg_words)/100))
print(paste("The ratio of positive words in Lawrence is", length(lawrence_pos_words)/100))
print(paste("The ratio of negative words in Lawrence is", length(lawrence_neg_words)/100))
print(paste("The ratio of positive words in Stein is", length(stein_pos_words)/100))
print(paste("The ratio of negative words in Stein is", length(stein_neg_words)/100))
print(paste("The ratio of positive words in Woolf is", length(woolf_pos_words)/100))
print(paste("The ratio of negative words in Woolf is", length(woolf_neg_words)/100))
[1] "The ratio of positive words in Joyce is 0.08" [1] "The ratio of negative words in Joyce is 0.16" [1] "The ratio of positive words in Lawrence is 0.09" [1] "The ratio of negative words in Lawrence is 0.32" [1] "The ratio of positive words in Stein is 0.09" [1] "The ratio of negative words in Stein is 0.29" [1] "The ratio of positive words in Woolf is 0.21" [1] "The ratio of negative words in Woolf is 0.32"
# woolf 모델을 임베딩 매트릭스로 변환 -> 벡터 유사도 계산을 위해
woolf_embeddings <- as.matrix(author_embeddings[['woolf']],)
# 특정 단어와 유사도가 높은 어휘들을 찾는다. 이 때 predict 함수로 검색 어휘의 임베딩 매트릭스를 얻는다.
# woolf 모델에서 love에 해당하는 임베딩 벡터를 구한 후,
wv <- predict(author_models[['woolf']], newdata = "love", type = "embedding")
val <- wv[1,]
# 혹은 다음과 같이 임베딩 매트릭스에서 원하는 어휘의 임베딩 벡터를 구할 수도 있다.
# val <- author_embeddings[['woolf']]['love',]
# 이 벡터와 유사한 벡터를 전체 벡터들 중에서 찾는다. apply 함수를 이용하면 빨리 처리할 수 있다.
res <- apply(woolf_embeddings, 1, function(rw) cosine(x = val, y = rw)) # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:10])
love passion hypocrisy envy friendship dislike 1.0000000 0.6378813 0.6300378 0.6239195 0.6081415 0.5962149 passionately death lust ‘life 0.5943222 0.5912219 0.5841448 0.5809732
# 두 어휘의 임베딩 벡터 거리를 계산한다(행렬 연산).
wv <- predict(author_models[['woolf']], newdata = c("love", "hate"), type = "embedding")
# 각 단어의 임베딩 벡터를 더한다. 즉 'love'와 'hate'가 더해진 임베딩 벡터를 구한다.
val <- wv['love',] + wv['hate',]
# print(val)
# 'love'와 'hate'가 더해진 임베딩 벡터와 가장 가까운 벡터를 가진 어휘를 찾는다.
# 벡터간 유사도를 계산할 때는 코사인 유사도(cosine 함수)를 이용한다.
res <- apply(woolf_embeddings, 1, function(rw) cosine(x = val, y = rw)) # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:10])
love hate vanities envy suffer hypocrisy
0.8849086 0.8849086 0.6769011 0.6766069 0.6338068 0.6229489
hatred savagery jealousy interfering
0.6086594 0.5857570 0.5846844 0.5801688
# queer와 유사한 단어들 중 긍정/부정 어휘들의 벡터들과 queer 벡터의 유사도를 구한다.
woolf_pos_sims <- apply(author_embeddings[['woolf']][woolf_pos_words,], 1, function(rw) cosine(x = woolf_embeddings['queer',], y = rw))
woolf_neg_sims <- apply(author_embeddings[['woolf']][woolf_neg_words,], 1, function(rw) cosine(x = woolf_embeddings['queer',], y = rw))
# 거리값은 유사도값을 뒤집은 것으므로 "1 - 유사도"이다.
woolf_pos_dists <- 1 - woolf_pos_sims
woolf_neg_dists <- 1 - woolf_neg_sims
# 긍정/부정 어휘들의 거리값 평균을 구한 후 긍정어와 부정어의 차이를 구한다.
woolf_positivity_queer <- mean(woolf_neg_dists) - mean(woolf_pos_dists)
# woolf_positivity_queer <- sum(woolf_pos_sims) - sum(woolf_neg_sims)
print(woolf_positivity_queer)
[1] -0.02003773
# queer와 긍정어와 부정어들간 거리값을 확인해 본다.
print(woolf_pos_dists)
cat('--------------\n')
print(woolf_neg_dists)
amusing spontaneous charming astonishingly attractive
0.4159755 0.4241790 0.4355073 0.4801285 0.4877095
smile pretty awe wonderful astonishing
0.4945430 0.4945625 0.4975180 0.5012383 0.5012868
magnificent amiable nice sharp modest
0.5096653 0.5102666 0.5120216 0.5221945 0.5297301
sensitive bright romantic humble prominent
0.5301137 0.5310545 0.5336006 0.5353009 0.5371554
extraordinarily
0.5404406
--------------
odd strange disturbing sickly vague frightening
0.2966291 0.3849504 0.4052247 0.4084758 0.4092115 0.4120809
disagreeable sad disliked oddly sinister solicitude
0.4322023 0.4518979 0.4589085 0.4614068 0.4628387 0.4719150
painful awfully glum ominous unpleasant distasteful
0.4792584 0.4887541 0.4982792 0.5020226 0.5032926 0.5078179
sly incongruous distaste suspiciously mystery pathetic
0.5097191 0.5099150 0.5123862 0.5210352 0.5220025 0.5220147
unnecessary oddest unusual melancholy flimsy alarming
0.5220989 0.5289497 0.5306460 0.5335118 0.5343147 0.5363234
boredom object
0.5375488 0.5400248
# 다른 세 작가에 대해서도 동일한 과정을 적용한다.
# 각 작가의 임베딩 벡터 매트릭스를 만든다.
joyce_embeddings <- as.matrix(author_embeddings[['joyce']])
lawrence_embeddings <- as.matrix(author_embeddings[['lawrence']])
stein_embeddings <- as.matrix(author_embeddings[['stein']])
# joyce
joyce_pos_sims <- apply(joyce_embeddings[joyce_pos_words,], 1, function(rw) cosine(x = joyce_embeddings['queer',], y = rw))
joyce_neg_sims <- apply(joyce_embeddings[joyce_neg_words,], 1, function(rw) cosine(x = joyce_embeddings['queer',], y = rw))
joyce_pos_dists <- 1 - joyce_pos_sims
joyce_neg_dists <- 1 - joyce_neg_sims
joyce_positivity_queer <- mean(joyce_neg_dists) - mean(joyce_pos_dists)
# joyce_positivity_queer <- sum(joyce_pos_sims) - sum(joyce_neg_sims)
print(joyce_positivity_queer)
[1] 0.004759233
print(joyce_pos_dists)
cat('--------------\n')
print(joyce_neg_dists)
excited lovely awe decent satisfy nice thrilled nicer
0.4210885 0.4580258 0.4953935 0.5298472 0.5305891 0.5340911 0.5358230 0.5498801
--------------
smell stale coarse strange tired struggling terrible
0.3861283 0.4546362 0.4965594 0.4984882 0.4993610 0.5057617 0.5064134
bother foul smelt decay stifling wild suck
0.5174079 0.5212522 0.5279060 0.5398718 0.5412109 0.5429565 0.5442462
worst irritation
0.5513480 0.5520768
# lawrence
lawrence_pos_sims <- apply(lawrence_embeddings[lawrence_pos_words,], 1, function(rw) cosine(x = lawrence_embeddings['queer',], y = rw))
lawrence_neg_sims <- apply(lawrence_embeddings[lawrence_neg_words,], 1, function(rw) cosine(x = lawrence_embeddings['queer',], y = rw))
lawrence_pos_dists <- 1 - lawrence_pos_sims
lawrence_neg_dists <- 1 - lawrence_neg_sims
lawrence_positivity_queer <- mean(lawrence_neg_dists) - mean(lawrence_pos_dists)
# lawrence_positivity_queer <- sum(lawrence_pos_sims) - sum(lawrence_neg_sims)
print(lawrence_positivity_queer)
[1] 0.03514807
print(lawrence_pos_dists)
cat('--------------\n')
print(lawrence_neg_dists)
prominent sharp grin like gentle
0.3306781 0.3568192 0.3839144 0.4404632 0.4567575
smile humorous rapt extraordinary
0.4769002 0.4932401 0.4941408 0.5149683
--------------
odd savage wicked sinister absurd
0.3695543 0.4073070 0.4165786 0.4202439 0.4243594
peculiar strange wild oddest melancholy
0.4427576 0.4464482 0.4465072 0.4520501 0.4552310
defiant pathetic faint unnatural funny
0.4570941 0.4587301 0.4641097 0.4721134 0.4751702
oddly crumpled wrinkled blunt jeering
0.4793590 0.4841202 0.4863810 0.4913699 0.4931961
dangerous blind haggard domineering devilish
0.4935949 0.4985599 0.4993771 0.5047609 0.5068476
sneer shabby mocking vicious irresponsible
0.5087988 0.5123263 0.5125555 0.5178180 0.5196422
terrible incomprehension
0.5215478 0.5231417
# stein
stein_pos_sims <- apply(stein_embeddings[stein_pos_words,], 1, function(rw) cosine(x = stein_embeddings['queer',], y = rw))
stein_neg_sims <- apply(stein_embeddings[stein_neg_words,], 1, function(rw) cosine(x = stein_embeddings['queer',], y = rw))
stein_pos_dists <- 1 - stein_pos_sims
stein_neg_dists <- 1 - stein_neg_sims
stein_positivity_queer <- mean(stein_neg_dists) - mean(stein_pos_dists)
# stein_positivity_queer <- sum(stein_pos_sims) - sum(stein_neg_sims)
print(stein_positivity_queer)
[1] -0.0171236
print(stein_pos_dists)
cat('--------------\n')
print(stein_neg_dists)
nice humor marvellous wonderful lover
0.4324507 0.4934363 0.5147753 0.5262983 0.5465836
cherished distinguished fine fun
0.5541693 0.5567429 0.5592445 0.5655816
--------------
strange poor brutal nasty ugly
0.4059059 0.4487226 0.4508060 0.4548882 0.4678780
funny flighty uncertain badly sad
0.4783487 0.4828184 0.4830050 0.4918080 0.5051616
bad angry ashamed disgusted strained
0.5085252 0.5085407 0.5138997 0.5145495 0.5155280
hard fat shame uncomfortable poison
0.5190932 0.5224168 0.5230778 0.5265052 0.5301953
hurt irritable stubborn broken pale
0.5316128 0.5339198 0.5407190 0.5536029 0.5546054
expensive difficult death unhappy
0.5564286 0.5569840 0.5630232 0.5640897
# Glove 임베딩 벡터 이용해 보기
pretrained_embeddings <- read.table("./embeddings/pretrained/glove6b/glove.6B.300d.csv", sep = ",", header = FALSE, row.names = 1)
pretrained_embeddings <- as.matrix(pretrained_embeddings)
print(dim(pretrained_embeddings))
[1] 400000 300
# 어휘의 임베딩을 활용해 보기
# val <- pretrained_embeddings['england',]
val <- pretrained_embeddings['king',] - pretrained_embeddings['man',] + pretrained_embeddings['woman',]
# val <- pretrained_embeddings['guy',] - pretrained_embeddings['he',] + pretrained_embeddings['she',]
# print(val)
res <- apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = val, y = rw)) # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:20])
king queen monarch throne princess mother daughter kingdom 0.8065858 0.6896163 0.5575491 0.5565375 0.5518684 0.5142154 0.5133157 0.5025345 prince elizabeth wife crown woman her royal marry 0.5017740 0.4908031 0.4840559 0.4728340 0.4675374 0.4504166 0.4489115 0.4381888 married sister husband ii 0.4308436 0.4290439 0.4238651 0.4199441
# Search term 확장하기 (1)
fish <- pretrained_embeddings['fish',]
fishy <- sort(apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = fish, y = rw)), decreasing = TRUE)
fishy <- names(fishy)[1:10]
print(fishy)
# Search term 확장하기 (2): 확장된 search terms의 평균 벡터로 확장 검색
comb_vals <- pretrained_embeddings[c("fish","salmon","tuna", "shrimp", "trout"),]
comb_vals <- apply(comb_vals, 2, mean)
expanded_fishy <- apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = comb_vals, y = rw)) # val 값을 전체 임베딩 매트릭스와 비교
print(sort(expanded_fishy, decreasing = TRUE)[1:50])